89 research outputs found
Clustering Arabic Tweets for Sentiment Analysis
The focus of this study is to evaluate the impact of linguistic preprocessing and similarity functions for clustering Arabic Twitter tweets. The experiments apply an optimized version of the standard K-Means algorithm to assign tweets into positive and negative categories. The results show that root-based stemming has a significant advantage over light stemming in all settings. The Averaged Kullback-Leibler Divergence similarity function clearly outperforms the Cosine, Pearson Correlation, Jaccard Coefficient and Euclidean functions. The combination of the Averaged Kullback-Leibler Divergence and root-based stemming achieved the highest purity of 0.764 while the second-best purity was 0.719. These results are of importance as it is contrary to normal-sized documents where, in many information retrieval applications, light stemming performs better than root-based stemming and the Cosine function is commonly used
Clustering Arabic Tweets for Sentiment Analysis
The focus of this study is to evaluate the impact of linguistic preprocessing and similarity functions for clustering Arabic Twitter tweets. The experiments apply an optimized version of the standard K-Means algorithm to assign tweets into positive and negative categories. The results show that root-based stemming has a significant advantage over light stemming in all settings. The Averaged Kullback-Leibler Divergence similarity function clearly outperforms the Cosine, Pearson Correlation, Jaccard Coefficient and Euclidean functions. The combination of the Averaged Kullback-Leibler Divergence and root-based stemming achieved the highest purity of 0.764 while the second-best purity was 0.719. These results are of importance as it is contrary to normal-sized documents where, in many information retrieval applications, light stemming performs better than root-based stemming and the Cosine function is commonly used
Towards a Cloud-Based Ontology for Service Model Security -- Technical Report
The adoption of cloud computing has brought significant advancements in the
operational models of businesses. However, this shift also brings new security
challenges by expanding the attack surface. The offered services in cloud
computing have various service models. Each cloud service model has a defined
responsibility divided based on the stack layers between the service user and
their cloud provider. Regardless of its service model, each service is
constructed from sub-components and services running on the underlying layers.
In this paper, we aim to enable more transparency and visibility by designing
an ontology that links the provider's services with the sub-components used to
deliver the service. Such breakdown for each cloud service sub-components
enables the end user to track the vulnerabilities on the service level or one
of its sub-components. Such information can result in a better understanding
and management of reported vulnerabilities on the sub-components level and
their impact on the offered services by the cloud provider. Our ontology and
source code are published as an open-source and accessible via GitHub:
\href{https://github.com/mohkharma/cc-ontology}{mohkharma/cc-ontology}Comment: 8 page
ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic
This paper presents the ArBanking77, a large Arabic dataset for intent
detection in the banking domain. Our dataset was arabized and localized from
the original English Banking77 dataset, which consists of 13,083 queries to
ArBanking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA)
and Palestinian dialect, with each query classified into one of the 77 classes
(intents). Furthermore, we present a neural model, based on AraBERT, fine-tuned
on ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and
Palestinian dialect, respectively. We performed extensive experimentation in
which we simulated low-resource settings, where the model is trained on a
subset of the data and augmented with noisy queries to simulate colloquial
terms, mistakes and misspellings found in real NLP systems, especially live
chat queries. The data and the models are publicly available at
https://sina.birzeit.edu/arbanking77
Offensive Hebrew Corpus and Detection using BERT
Offensive language detection has been well studied in many languages, but it
is lagging behind in low-resource languages, such as Hebrew. In this paper, we
present a new offensive language corpus in Hebrew. A total of 15,881 tweets
were retrieved from Twitter. Each was labeled with one or more of five classes
(abusive, hate, violence, pornographic, or none offensive) by Arabic-Hebrew
bilingual speakers. The annotation process was challenging as each annotator is
expected to be familiar with the Israeli culture, politics, and practices to
understand the context of each tweet. We fine-tuned two Hebrew BERT models,
HeBERT and AlephBERT, using our proposed dataset and another published dataset.
We observed that our data boosts HeBERT performance by 2% when combined with
D_OLaH. Fine-tuning AlephBERT on our data and testing on D_OLaH yields 69%
accuracy, while fine-tuning on D_OLaH and testing on our data yields 57%
accuracy, which may be an indication to the generalizability our data offers.
Our dataset and fine-tuned models are available on GitHub and Huggingface.Comment: 8 pages, 1 figure, The 20th ACS/IEEE International Conference on
Computer Systems and Applications (AICCSA
Extracting Synonyms from Bilingual Dictionaries
We present our progress in developing a novel algorithm to extract synonyms
from bilingual dictionaries. Identification and usage of synonyms play a
significant role in improving the performance of information access
applications. The idea is to construct a translation graph from translation
pairs, then to extract and consolidate cyclic paths to form bilingual sets of
synonyms. The initial evaluation of this algorithm illustrates promising
results in extracting Arabic-English bilingual synonyms. In the evaluation, we
first converted the synsets in the Arabic WordNet into translation pairs (i.e.,
losing word-sense memberships). Next, we applied our algorithm to rebuild these
synsets. We compared the original and extracted synsets obtaining an F-Measure
of 82.3% and 82.1% for Arabic and English synsets extraction, respectively.Comment: In Proceedings - 11th International Global Wordnet Conference
(GWC2021). Global Wordnet Association (2021
SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks
SALMA, the first Arabic sense-annotated corpus, consists of ~34K tokens,
which are all sense-annotated. The corpus is annotated using two different
sense inventories simultaneously (Modern and Ghani). SALMA novelty lies in how
tokens and senses are associated. Instead of linking a token to only one
intended sense, SALMA links a token to multiple senses and provides a score to
each sense. A smart web-based annotation tool was developed to support scoring
multiple senses against a given word. In addition to sense annotations, we also
annotated the corpus using six types of named entities. The quality of our
annotations was assessed using various metrics (Kappa, Linear Weighted Kappa,
Quadratic Weighted Kappa, Mean Average Error, and Root Mean Square Error),
which show very high inter-annotator agreement. To establish a Word Sense
Disambiguation baseline using our SALMA corpus, we developed an end-to-end Word
Sense Disambiguation system using Target Sense Verification. We used this
system to evaluate three Target Sense Verification models available in the
literature. Our best model achieved an accuracy with 84.2% using Modern and
78.7% using Ghani. The full corpus and the annotation tool are open-source and
publicly available at https://sina.birzeit.edu/salma/
Nabra: Syrian Arabic Dialects with Morphological Annotations
This paper presents Nabra, a corpora of Syrian Arabic dialects with
morphological annotations. A team of Syrian natives collected more than 6K
sentences containing about 60K words from several sources including social
media posts, scripts of movies and series, lyrics of songs and local proverbs
to build Nabra. Nabra covers several local Syrian dialects including those of
Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and
Suwayda. A team of nine annotators annotated the 60K tokens with full
morphological annotations across sentence contexts. We trained the annotators
to follow methodological annotation guidelines to ensure unique morpheme
annotations, and normalized the annotations. F1 and kappa agreement scores
ranged between 74% and 98% across features, showing the excellent quality of
Nabra annotations. Our corpora are open-source and publicly available as part
of the Currasat portal https://sina.birzeit.edu/currasat
WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task
We present WojoodNER-2023, the first Arabic Named Entity Recognition (NER)
Shared Task. The primary focus of WojoodNER-2023 is on Arabic NER, offering
novel NER datasets (i.e., Wojood) and the definition of subtasks designed to
facilitate meaningful comparisons between different NER approaches.
WojoodNER-2023 encompassed two Subtasks: FlatNER and NestedNER. A total of 45
unique teams registered for this shared task, with 11 of them actively
participating in the test phase. Specifically, 11 teams participated in
FlatNER, while teams tackled NestedNER. The winning teams achieved F1
scores of 91.96 and 93.73 in FlatNER and NestedNER, respectively
The carrying angle: racial differences and relevance to inter-epicondylar distance of the humerus
The human carrying angle (CA) is a measure of the lateral deflection of the forearm from the arm. The importance of this angle emerges from its functional and clinical relevance. Previous studies have correlated this angle with different parameters including age, gender, and handedness. However, no reports have focused on race-dependent variations in CA or its relation to various components of the elbow joint. This study aimed to investigate the variations in CA with respect to race and inter-epicondylar distance (IED) of the humerus. The study included 457 Jordanian and 345 Malaysian volunteers with an age range of 18–21 years. All participants were right-hand dominant with no previous medical history in their upper limbs. Both CA and IED were measured by well-trained medical practitioners according to a well-established protocol. Regardless of race, CA was greater on the dominant side and in females. Furthermore, CA was significantly greater in Malaysian males compared to Jordanian males, and significantly smaller in Malaysian females compared to their Jordanian counterparts. Finally, CA significantly decreased with increasing IED in both races. This study supports effects of gender and handedness on the CA independent of race. However, CA also varies with race, and this variation is independent of age, gender, and handedness. The evaluation also revealed an inverse relationship between CA and IED. These findings indicate that multiple factors including race and IED should be considered during the examination and management of elbow fractures and epicondylar diseases
- …